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Literature search is arguably one of the most important phases of the academic and non-academic research. 
The increase in the number of published papers each year makes manual search inefficient and furthermore 
insufficient. Hence, automatized methods such as search engines have been of interest in the last thirty 
years. Unfortunately, these traditional engines use keyword-based approaches to solve the search problem, 
but these approaches are prone to ambiguity and synonymy. On the other hand, bibliographic search tech- 
niques based only on the citation information are not prone to these problems since they do not consider 
textual similarity. For many particular research areas and topics, the amount of knowledge to humankind 
is immense, and obtaining the desired information is as hard as looking for a needle in a haystack. Further- 
more, sometimes, what we are looking for is a set of documents where each one is different than the others, 
but at the same time, as a whole we want them to cover all the important parts of the literature relevant 
to our search. This paper targets the problem of result diversification in citation-based bibliographic search. 
It surveys a set of techniques which aim to find a set of papers with satisfactory quality and diversity. We 
enhance these algorithms with a direction-awareness functionality to allow the users to reach either old, 
well-cited, well-known research papers or recent, less-known ones. We also propose a set of novel techniques 
for a better diversification of the results. All the techniques considered are compared by performing a rigor- 
ous experimentation. The results show that some of the proposed techniques are very successful in practice 
while performing a search in a bibliographic database. 
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1. INTRODUCTION 

The academic community has published millions of research papers to date, and the 
number of new papers has been increasing with time. For example, based on DBLlQ 
computer scientists published 3 times more papers in 2010 than in 2000 (see Fig- 
ureJTUeft). With more than one hundred thousand new papers each year, performing a 
complete literature search became a herculean task. A paper cites 20 other papers on 
average (see Figure fright for the distribution of citations in our data), which means 
that there might be more than a thousand papers that cite or are cited by the pa- 
pers referenced in a research article. Researchers typically rely on manual methods 
to discover new research such as keyword-based search via search engines, reading 
proceedings of conferences, browsing publication list of known experts or checking the 
reference list of the paper they are interested. These techniques are time-consuming 
and only allow to reach a limited set of documents in a reasonable time. Developing 
tools that help researchers to find relevant papers they do not know has been of inter- 
est for the last thirty years. 

Some of the existing approaches and tools for the literature search cannot com- 
pete with some characteristics of today's literature. For example, keyword-based ap- 
proaches suffer from the confusion induced by different names of identical concepts in 
different fields. (For instance, partially ordered set or poset are also often called directed 
acyclic graph or DAG). Conversely, two different concepts may have the same name in 
different fields (for instance, hybrid is commonly used to specify software hybridiza- 
tion, hardware hybridization, or algorithmic hybridization). These two problems may 
drastically increase the number of suggested but unrelated papers. 




Publication year References/Citations 



Fig. 1 . Number of new papers published each year based on DBLP (left), and number of papers with given 
citation and reference count (right). 



Since they do not use textual information, bibliographic search techniqu es based 
only on the citation information do not suffer from above-mentioned problem s [Kessler 
1963; McNee et al. 2002; Small 1973; Lawrence et al. 19991 [LTang et al. 20111|Gori and 



Pucci 2006| |Lao and Cohen 2010||LTan d Willett 2009; Ma et al. 2008). Furthermore, it 
has been shown that text-based similarity is not sufficient for this task and tha t most 
of the rel evant information is contained within the citation information [Stro hman| 
|et al. 2007J . Besides, it is plausible that there i s already a correlation between citation 
similarities and text similarities of the papers [Salton 1963; |Peters et al. 1995) Follow- 
ing the idea of using citation information for bibliographic search, we built an efficient 

1 http : //www, inf ormatik. uni-trier . de/-iey/db7|statistics based on data acquired in Dec'll 
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and e ffective web service called theadvisoipltKucuktu nc et al. 2012b Kucuktunc et al 



2012a | based on personalized PageRank. It takes a bibliography file containing a set 
of papers, i.e., seeds, as an input to initiate the search. The algorithms employed by 
theadvisor have the direction-awareness functionality which allow the user to specify 
her interest in classical or recent papers. Taking this criteria into account, the service 
returns a set of suggested papers ordered with respect to a ranking function. After 
obtaining the results, the user can give positive feedback to the system, and if desired, 
the output set is refined. 

Today, for many particular research areas and topics, the amount of knowledge to 
humankind is immense and reaching the correct information is as hard as looking for 
a needle in a haystack. Furthermore, sometimes, what we are looking for is a set of 
results where each one is different than the others, but at the same time, as a whole 
we want them to cover all the important parts of the literature relevant to our search. 
Hence, diversifying the results of the search process is an important task to increase 
the amount of information one can reach via an automized search tool. There exist 
many recommender systems, such as Google web search, which personalize the output 
with respect to user query/history. It is a well known fact that for several applica- 
tions, person alization can be an impor tant limitation while reaching all the relevant 



information | Drosou a nd Pitoura 20101, and diversif ication can be used to increase the 
coverage of th e result s~and hen ce, use r satisfaction [Agraw al et al. 2009||Clarke et al. 
2008; Me rit al. 2010t rGollapu di and Sharma 20 091. 

Inthis work, we target the bibliographic search problem and diversifying the results 
of the citation/paper recommendation process with the following objectives in mind: 
(1) the direction awareness property is kept, (2) the method should be efficient enough 
to be computable in real time, and (3) the results are relevant to the query and also 
diverse within the set. The contribution of this work is three-fold: 



— We survey v arious random wal k-based d iversity methods (i.e., G RASSHOPPER [Zhu 
et al. 2007) , DivRank ]Mei et al. 2010[ variants, and DRAGON ]Tong et al. 2011| ) 



and relevancy/diversity measures. 

— We enhance these algorithms with the direction awareness property. 

— We propose new algorithms based on vertex selection (IL1, IL2, LM, 7-RLM) and 
query refinement (GSPARSE, FEED). 

— We perform a rigorous set of experiments with various evaluation criteria and show 
that the proposed 7-RLM algorithm is suitable in practice for real-time diverse bib- 
liographic search. 

All of the algorithms in this paper are implemented and tested within theadvisor and 
the best one (7-RLM) will be used to power the system in a very near future. 

2. BACKGROUND 

2.1. Graph-based Citation Recommendation 

Citation-analysis-based paper recommendation has been a popular problem since 
the '60s. There are methods that only take local neighbors (i.e., citations an d ref- 
erenc es) into accou nt, e.g., bibliographic c oupling [Kessle r 1963 1, cocitation [Small 



19731. and CCIDF [Lawrenc e et al. 1999|. Recent studies, howe ver, employ graph 



based al gorithms, such as Katz [Liben-Nowell and Kleinberg 2007J, rand om walk with 
restar ts [Pan et al. 20 041, or well-known PageRank (PR) algorithm [ Brin and Page 
1998 1 to investigate the whole citation network. PaperRank [Gori and Pucci 2006| |, Ar- 



^http : //theadvisor . osu . edu/ 
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ticleRank | Li and Wi llett 20091, and Katz distance-based methods [S trohman et al 
2007) are typical examples. 

Ranking with Personalized PageRank (PPR) is a good way to find relevance scores of 
the papers. However, these algorithms treat the citations and references in the same 
way. This may not lead the researcher to recent and relevant papers if she is more 
interested in those. Old and well cited papers have an advantage with respect to the 
relevance scores since they usually have more edges in the graph. Hence the graph 
tends to have more and snorter paths from the seed papers to old papers. We previ- 
ously defined the class of direction aware algorithms based on personalized PageRank, 
which can be tuned to reach a variety of citati on patterns, allowing th em to match 
the patterns of recent or traditional documents [Kucuktunc et al. 2012b I . We give the 



details of PageRank-based algorithms in Section 2.4 



2.2. Result Diversification on Graphs 

The importance of diversity in ranking has been discussed in v arious data mining 
fields, including text ret rieval |Carbonell and Goldstein 1998 1, recommende r sys- 
tems [Ziegler et al. 20 05], online shopping | |Vee et al. 2008J , and web search [Clarke 



|et al. 2008| . The topic is ofte n addressed as a multi-object ive optimization prob 
lem I jDrosou and Pitoura 2010|, which is shown to be NP-h ard [Carterette 20091, and, 
theref ore, some greedy [Agrawal et al. 2009" Haritsa 2009 1 and clustering-based |Liu| 



|and J agadish 2009) heuristics were proposed. Although there is no single definition of 
diversity, different objective functio ns and axioms expected to b e satisfied by a diver- 
sification system were discussed in [Gollapudi and Sharma 2009]. 

Diversification of the results of random walk-based methods on graphs only at- 
tracted attention recently. GRASSHOPPER is one of the earlier algorithms and ad- 
dresses divej^ified_j^nking on graphs by vertex selection with absorbing random 
walks ]Zhu et al. 2007| . It greedily selects the highest ranked vertex at each step 
and turns it into a sink for the next steps. Since the algorithm has a high time com- 
plexity, it is not scalable to large graphs. DivRank, on the other hand, combines the 
greedy vertex selectio n process in one unified step with the vertex reinforced random 
walk (VRRW) model [Mei et al. 20101. This algorithm updates the transition matrix 
at each iteration with respect to the current or cumulative ranks of the nodes to intro- 
duce a rich-gets-richer mechanism to the ranking. But since the method updates the 
full transition matrix at each iteration, more iterations are needed for convergence; 
therefore, the computation co st increas es. The shortcom ings of those techniques were 
discussed in [Li and Yu 2011] in detail. [Tong e t al. 2011) formalizes the problem from 
an optimization viewpoint, proposes the goodness measure to combine relevancy and 
diversity, and presents a near-optimal algorithm called DRAGON. These algorithms are 
further discussed in Section HI 



Coverage-based methods (such as the one in [Li and Yu 20111) are also interesting 
for diversification purposes; however, they do not preserve the direction awareness 
property of the ranking function. Since our aim is to diversify the results of our paper 
recommendation service, we omitted the results of those coverage-based methods in 
our experiments. 

2.3. Problem Definition 

Let G = (V, E) be a directed citation graph where V = {vi, ...,«„} is the vertex set and 
E, the edge set, contains an edge (u, v) if paper u cites paper v. Let S + (u) = |{(u, v) e 
E}\ and S~(u) — \{{v,u) e E}\ be the number of references of and citations to paper 
u, respectively. We define the weight of an edge, w(u, v), based on how important the 
citation is or how many times this paper is cited; however, for the sake of simplicity 
we take w(u, v) = 1 for all (u, v) e E. Therefore, the nonsymmetric matrix W : V x V 
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becomes a -1 matrix. A summary of the notation used throughout the paper is given 
in Table HU 

We target the problem of paper recommen dation assuming that th e researcher has 
already collected a list of papers of interest JKucuktunc et al. 2012b| . Therefore, the 
objective is to return papers that extend that list: given a set of m seed papers M = 
{pi, ...,p m } s.t. M C V, and a parameter k, return top-fc papers which are relevant to 
the ones in M. With the diversity objective in mind, we want to recommend papers to 
be not only relevant to the query set M, but also covering different topics around the 
query set. 



Table I. 
Notation 





Symbol 


Definition 


Graph 


G= (V, E) 
G' = (V,E') 
n 

W 

w(u, v) 
8-(v),8+{v) 
5(v) 
d(u, v) 


directed citation graph with V nodes and E edges 

undirected graph based on G 

\V\, number of vertices 

weights of the edges 

weight of the edge from u to v 

number of incoming or outgoing edges of v 

5~(v) + <5 + (i>), number of neighbors of vertex v 

shortest distance between u and uinG 




N e (S) 


^-step expansion set ofS C V 


Query 


M 

m 
k 
R 
d 


a set of seed papers {pi, . . . ,p m }, M. C V 

\M\, number of seed papers 

required number of results, k < n 

a set of recommended vertices, R C V and \R\ = k 

damping factor of random walk with restart, < d < 1 




K 


direction-awareness parameter, < k < 1 


walk 


P* 
t 

Pi 

Vt 


prior distribution for personalized PageRank 
iteration, or timestamp 

probability vector of being on a state at iteration t 
vector of number of visits at iteration t 


Random 


A 
A' 
P 


structurally-symmetric n x n transition matrix based on G 
structurally-symmetric n x n transition matrix based on G' 
n x n transition matrix for an iterative random walk 


7T 

e 


Poo, stationary probability vector, ir(v) = 1 
convergence threshold 




S 


a subset of vertices, S = {si, ...,st},SC V 


Measures 


s 

rel(S) 
diff{S) 
use(S) 
dense (S) 

a e (S) 


top-A: results, S = argmaxg/cv,|S'|=fc X/ugS' nv 

normalized relevance of the set 

difference ratio of two sets 

usefulness of the set 

^-step graph density 

^-expansion ratio 



2.4. PageRank, Personalized PageRank, and direction-aware Personalized PageRank 

Let G" = (V, E') be an undirected graph of the citation graph, p(u, v) be the transition 
probability between two nodes (states), and d be the damping factor. 

2.4.1. PageRank (PR). jBrin and Page 1998| 

We can define a random walk on G' arising from following the edges (links) with 
equal probability and a random restart at an arbitrary vertex with (1 — d) teleporta- 
tion probability. The probability distribution over the states follows the discrete time 
evolution equation: 

Pt+i = P Pt, (1) 
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where p t is the vector of probabilities of being on a certain state at iteration t, and P 
is the transition matrix defined as: 

If the network is ergodic (i.e., irreducible and non-periodic), Eq. [T] converges to a 
stationary distribution ir = Pit after a number of iterations. And the final distribution 
7r gives the PageRank scores of the nodes based on only centrality. 

In practice the algorithm is said to have converged when the probability of the pa- 
pers are stable, i.e., when the process is in a steady state. Let 

At = (Pt(l) - Pt-i(l), • ■ • , P*(n) - Pt-i(n)) (3) 

be the difference vector. We say that the process is in the steady state when the L2 
norm of A t is smaller than a given value e. That is, 



i A *ii2= /E(p*w-p t -iW) 2 <e. (4) 



2.4.2. Personalized PageRank (PPR). [Haveliwala 20021 



In our problem, a set of nodes M was given as a query, and we want the random 
walks to teleport to only those given nodes. Let us define a prior distribution p* such 
that: 

P \0, otherwise. 

If we substitute the two (l/n)s in Eq. [2] with p*, we get a variant o f PageRank, 
which is known as p ersonalized PageR ank or topic-sensitive PageRank [Haveliwala 
2002|. PaperRank ]Gori and~P ucci 20061 applies personalized PageRank method to 
the undirected citation graph G'. 



2.4.3. Direction-aware Random Walk with Restart DaRWR. [Kucuktunc et al. 2012b] 
We defined a direction awareness par ameter k g [0, 1] to o btain more recent or tra- 
ditional results in the top-fc documents [Kucu ktunc et a l. 2012b]. Given a query with 
inputs k, a seed paper set M, damping factor d, and direction awareness parameter 
k, Direction-aware Random Walk with Restart (DaRWR) computes the steady-state 
probability vector n. The ranks of papers after iteration t is computed with the follow- 
ing linear equation: 

Pt+i = p* + Ap t , (6) 

where p* is an n x 1 restart probability vector calculated with (|5), and A is a 
structurally-symmetric n x n matrix of edge weights, such that 

if (i,j)eE 



■ rf(l-K) 

<5+(j) - 

0, otherwise. 



if (j,i)eE (7) 



Note that P transition matrix of random walk-based methods is built using A and p* ; 
however, the edg e weights in row s can be stored and read more efficiently with A in 
practice [Kucuktu nc et al. 2012a) . 

Figure |2] shows that the direction-awareness parameter k can be adjusted to reach 
papers from different years with a range from late 1980's to 2010 for almost all values 
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Fig. 2. Average publication year of top-10 recommendations by DARWR based on d and re. 

of d. In our online service, the parameter k can be set to a value of user's preference. It 
allows the user to obtain recent papers by setting k close to 1 or finding older papers 
by setting n close to 0. 

3. DIVERSIFICATION METHODS 

We classify the diversification methods for the paper recommendation problem based 
on whether the algorithm needs to rank the papers only once or multiple times. The 
first set of algorithms run a ranking function (e.g., PPR, DaRWR, etc.) once and select 
a number of vertices to find a diverse result set. The algorithms in the second set run 
the ranking function k times to select each result, and refine the search with some 
changes at each step. Although the former class of algorithms are preferred for practi- 
cal use, they may not be able to reach to the intended diversity levels due to the highly 
greedy nature of the vertex selection process. 

3.1. Diversification by vertex selection 

The following approaches are used after getting the direction-aware relevancy (pres- 
tige) rankings of the vertices for a given set of seed nodes. The ranking function is 
selected as DaRWR with parameters (k, d). 



3.1.1. DivRank: Vertex-reinforced random walks. [M ei et al. 2010| 

For the random walk based methods mentioned so far, the probabilities in the tran- 
sition matrix P do not change over the iterati ons. Using a var iant of random walks, 
called vertex-reinforced random walks (VRRW) [Pemantle 1992 1, Div RANK adjusts the 
transition matrix based on the number of visits to the vertices so far [M ei et al. 2010| . 
The original DivRank assumes that there is always an organic link for all the nodes 
returning back to the node itself which is followed with probability (I — a): 



Po(u,v) 



otherwise, 




(8) 



where w(u, v) is equal to 1 for (u, v) e E' , and otherwise. The transition matrix P t at 
iteration t is computed with 



P t (u,v) = (I - d) p*(v) + d 



(9) 



where r) t (v) is the number of visits of vertex v. It ensures that the highly ranked nodes 
collect more value over the iterations, resulting in the so called rich-gets-richer mech- 
anism. 
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In summary, for each iteration of the denned vertex-reinforced random walks, the 
transition probabilities from a vertex u to its neighbors are adjusted by the number 
of times they are visited up to that point r) t (v). Therefore, u gives a high portion of 
its rank to the frequently visited neighbors. Since the tracking of r\ t is nontrivial, the 
authors propose to estimate it and provide two different estimation models. One way 
is to employ the cumulative ranks to estimate r] t as 

t 

Efot(«)]cx£pi(«), (10) 

i=0 

and since the ranks will converge after sufficient number of iterations, it can also be 
estimated with pointwise ranks: 

E[ Vt (v)} oc p t (v). (11) 

While adapting DivRANK to our directional problem, we identified two problems: 
first, the initial ranks of all nodes should be set to a nonzero value; otherwise, the 
ranks cannot be distributed with Eq. [9|for both pointwise and cumulative estimation 
of %. Therefore, we set po(v) — 1/nforallue V. Second, an organic link returning back 
to node itself enables the node to preserve its rank. This is problematic since p* is only 
set for seed papers, and they tend to get richer over time. However, our objective is to 
distribute the probabilities over other nodes to get a meaningful ranking. We solved 
this problem by removing the organic links of seed papers, hence distributing all of 
their ranks towards their neighbors instead of only a of them. 

With the modifications above, we propose the direction-aware DivRANK algorithm: 



Po(u,v) 



0, if u G Ai, u = v 

(1-*) 

if u G M.. u v. (v. u) e E 

(12) 



if u £ M,u ^ v, (u,v) G E 
if u G M,u ^ v, (v,u) G E 
(1 — a), if u<£M,u = v 
a ( j^, ituiM,u^v,{u,v)eE 
a S - K ( u \ ; if u M, u ^ v, (v, u) G E 



where k is the direction awareness parameter. p in Eq. 12 can be directly used in 
Eq. [9] Depending on the estimation method to be whether cumulative or pointwise, 
we refer to direction-aware cumulative DivRANK as CDlvRANK, and direction- aw are 
pointwise DivRANK as PDivRank, respectively. 



3. 1.2. Dragon: Maximize the goodness measure. [Tong et a l. 2011| | 

One of many div ersity/relevance o ptimization functions found in the literature is the 
goodness measure [Tong eFal. 2011[ . It is defined as: 

fa (S) = 2 J2 - d A' (j, i)7r(i) - (1 - d) ^0") Y,p* (*")' (13) 

ieS i,j£S jes zGS 

where A' is the row-normalized adjacency matrix of the graph. The original algorithm 
runs on the undirected citation graph G' and uses a greedy heuristic to find a near- 
optimal solution set. The direction-aware variant of the algorithm, running on the di- 
rected citation graph and using the ranking vector DaRWR, is referred to as DRAGON. 
Accordingly, the direction-aware goodness measure fa{S) can be defined as: 

/ G (S) = 2^7r(i)-ok A(i,i)7r(i)-d(l- K ) J2 A(<,j)7r(*), (14) 
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where k is the direction awareness parameter, A is the row-normalized adjacency ma 



trix based on directed graph, and the last part of Eq. 13 is always zero = 0) 

since seed papers are never included in S. 

3.1.3. IL1, IL2: Ignore l-step expansion sets. 

Here we present a diversification approach that incrementally adds nodes to the 
recommendation list by ignoring their ^-distance neighbors. In other words, after se- 
lecting the highest ranked node r\ in the graph, the second highest ranked node r 2 is 
skipped if r 2 is in the expansion set of r\. The expansion set with 1-distance neighbors 
is defined as N(S) = S U {v e (V — S) : 3u G S, (u, v) £ E}, and the l-step expansion set 
is defined as 

N t {S) = S U {v G (V - S) : 3u e S, d(u, v) < £}, (15) 

in |Li and Yu 20111 

Based on the parameter £, the results do not include direct references or citations 
of another recommendation (£ = 1, referred to as IL1), and the papers that can be 
suggested with cocitation or cocoupling methods (£ = 2, referred to as IL2). 

3. 1.4. LM: Choose local maximas. 

Because of the smoothing proces s of random walk s, frequently visited nodes tend to 
increase the ranks of its neighbors [Mei et al. 2010]. Therefore, we argue that comput- 
ing local maximas and returning top-fc of them will guarantee that the nodes returned 
this way are recommended by taking the smoothing process of random walks into ac- 
count. 

Once the ranks are computed, the straightforward approach for getting the local 
maximas is to iterate over each node in the graph and check if its rank is greater than 
all of its neighbors' with a algorithm. However, the algorithm runs much faster 

in practice since every rank comparison between two unmarked nodes (either local 
maxima or not) will mark one of them. The LM algorithm is given in Alg. [l] 



ALGORITHM 1: Diversify with local maximas (LM) 

Input: G' — (V, E'): an undirected citation graph 
7r: ranks or stationary probabilities of the nodes in V 
k: required number of recommendations 
Output: A list of recommendations S 
L <— empty list of (v,ty v ) 
for each v G V do 
|_ lm[v] <- LocalMax 

for each v £ V do 

if lm[v] =L0CALMAX then 
for each v' e adj[v] do 
if TV < tt v then 

lm[v'\ «- NOTLOCALMAX 

else 

lm[v] 4- NotLocalMax 
L break 

if lm[v] =L0CALMAX then 

|_ L <- Lu{(v,n v )} 

SORT(L) w.r.t ivi non-increasing 
S L[l..k].v, i.e., top-fc vertices 
return S 
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3.1.5. 7-RLM: Choose relaxed local maximas. 

The drawback of diversifying with local maximas is that for large fc's (i.e., k > 10), 
the results of the recommendation algorithm are generally no longer related to the 
queried seed papers, but some popular ones in unrelated fields, e.g., a set of well-cited 
physics papers can be returned for a computer science related query. Although this 
might improve the diversity, it hurts the relevancy, hence, the results will no longer 
useful to the user. 

In order to keep the results within reasonable relevancy to the query and the di- 
versify them, we relax the algorithm by incrementally getting local maximas within 
top-7/c results until \S\ — k, and removing the selected vertices from the subgraph for 
the next local maxima selection. We refer this algorithm to as parameterized relaxed 
local maxima (7-RLM). Note that 1-RLM reduces to DaRWR and oo-RLM reduces to 
LM. The outline of the algorithm is given in Alg.[2] In the experiments, we select 7 = k 
and refer this algorithm as fc-RLM. Furthermore, we devise another experiment to see 
the effects of 7 with respect to different measures. 



ALGORITHM 2: Diversify with relaxed local maximas (7-RLM) 

Input: G' — (V, E') \ an undirected citation graph 

7r: ranks or stationary probabilities of the nodes in V 

k: required number of recommendations 

7: relaxation parameter 

Output: A list of recommendations S 

T SORT(V r ) w.r.t. lit non-increasing 

R <r- T[l : jk] 

while \S\ < k do 

R' <- FINDL0CALMAXIMAS(G,7?, 7r) 
if \R'\ >k-\S\ then 

SORT(i?') w.r.t. ivi non-increasing 
[ R' <- R'[l : {k - \S\)] 
S^- S UR' 
_ R<— R\R' 
return S 



3.2. Diversification by query refinement 

In this set of diversification algorithms, the ranking function is called multiple times 
while some of the parameters or graph structure are altered between those rankings. 

3.2. 1. Grasshoppe r: Incremental ran king using absorbing random walks. [Zhu et al. 2007 1 
GRASSHOPPER ]Zhu et al. 20 07] is a well-known diversification algorithm which 
ranks the graph multiple times by turning at each iteration the highest-ranked vertex 
into a sink node (A sink node only has a single outgoing edge to itself, so that all its 
rank stays trapped within the sink). Since the probabilities will be collected by the 
sink vertices when the random walk converges, the method estimates the ranks with 
the number of visits to each node before convergence. 

The original method uses a matrix inversion to find the expected number of visits; 
however, inverting a sparse matrix makes it dense, which is not practical for the large 
and sparse citation graph we are using. Therefore, we estimate the number of visits by 
iteratively computing the cumulative ranks of the nodes with DaRWR. 
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3.2.2. GSparse: Incremental ranking by graph sparsification. 

In this algorithm, in contrast with GRASSHOPPER, after executing the ranking func- 
tion, we propose to sparsify the graph by removing all reference and citation edges 
around the highest ranked node and repeat the process until k nodes are selected 
in total. Note that GRASSHOPPER converts the selected node into a sink node while 
GSparse disconnects it from the graph (see Alg. [3] for details). This way, the graph 
around that node becomes less dense, hence, the nodes will attract less visits in a 
random walk. 



ALGORITHM 3: Diversify by graph sparsification (GSPARSE) 

Input: G = (V, E): a directed citation graph 
M : query, a set of seed nodes 
k: required number of recommendations 
Output: A list of recommendations S 

s<-n 

G' 

for iter = 1 — > k do 

ranks <- DaRWR(G' = (V',E'),M) 
v arg max(ranfcs) 
S ^ SU{v} 
for each v' e adj [v] do 
L E'^E'\{(v,v')} 
_ V'<-V'\ {v} 
return S 



3.2.3. Feed: Feedback based on graph distance. 

For all the incremental algorithms, once the ranking function returns the most rel- 
evant node, it is most likely closer to some seed nodes than to other ones. In the next 
step, to obtain a different recommendation, one can decrease the importance of the 
closest seed nodes and increase the importance of the farthest seed nodes. 

Following this idea, we argue that the prior probability vector p* can be adjusted 
with the inverse of the graph distance between the results and the seed papers. Let p* t 
be the prior probability of the DaRWR algorithm at step t < k. It is initialized as 



where is the node selected at iteration < i < t — 1, and d(u, v) is the 
shortest distance between u and v. With an efficient implementation, computing 
median({d(r , v), . . . , d(r t -i, v)}) is made fast by pre-computing the distances between 
all seed nodes and all vertices in the graph. If min was used when a node is selected 
near a seed node, the function would always return 1, which is undesired. On the other 
hand, if avg was used when the seed nodes are far away from each other, a large dis- 
tance between a selected and a sink node would affect the process for many iterations. 
median is preferred over these functions because it gives meaningful values for the 
mentioned cases. 
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and computed for the next iterations with 
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4. EXPERIMENTS 

4.1. Evaluation measures 

The relevancy and diversity of the results should be measured with separate methods 
since the problem is multi-criteria. For both diversity and relevancy parts, we evaluate 
the quality of the results with a number of measures. 

4.1.1. Relevancy measures. 

Normalized relevance: The relevancy score of a set can be computed by comparing 
the or iginal ranking scores of the resulting set with the top-fc ranking list [Tong lit al | 



2011 1, defined as 



rel(S) = ^f^, (18) 

where n is the sorted ranks in non-increasing order. 

Difference ratio: Results of a diversity method is expected to be somewhat differ- 
ent than the top-fc relevant set of results since, as our experiments will show, the set of 
nodes recommended by the original DaRWR are not diverse enough. This is expected 



since highly ranked nodes will also increase the ranks of their neighbors |Mei et al. 
|2010] |. Nevertheless, the original result set has the utmost relevancy. This fact can 
mislead the evaluation of the experimental results. Therefore, we decided to measure 
the difference of each result set from the set of original top-A; nodes. The difference 
ratio is computed with 

diff{S,S) = l- lS ^, (19) 

where S is the top-k relevant set. 

Usefulness: The original ranking scores n actually show the usefulness of the nodes. 
Since these scores usually follow a power law distribution, the high ranked nodes col- 
lect most of the scores and the contribution of two low-ranked nodes to the rel measure 
can be almost the same even though the gap between their positions in the ranking is 
huge. Yet, the one with the slightly higher score might be useful where the other might 
not due to this gap. We propose the usefulness metric to capture what percentage of 
the results are actually useful regarding their position in the ranking: 

use(S) = \{»*S:« v <*}\ t (2Q) 

where n = TTioxfe! i.e., the relevancy score of the node with rank 10 x k, for k = \S\, 
and use(S) gives the ratio of the recommendations that are within top 10 x k of the 
relevancy list. 

4.1.2. Diversity measures. 

^-step graph density: A variant of graph density measure is the ^-step graph den- 
sity JTong et al. 2011) , which takes the effect of in-direct neighbors into account. It is 
computed with 

dens ^= \S\x(\S\-l) > (21) 
where dt,(u,v) — 1 when v is reachable from u within £ steps, i.e., d(u,v) < £, and 



otherwise. The inverse of Di(S) is used for the evaluation of diversity in |Mei et al. 
[20101 . 

^-expansion ratio : Other div ersity measures, expansion ratio and its variant i- 
expansion ratio | Li and Yu 2011) measure the coverage of the graph by the solution 
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set. They are computed using the £-step expansion set given in Eq. floras 



*(S) = i^l. (22) 



4.1.3. Other criteria. 
Goodness: Given in Eq. [14 



Average year: The average publication year of the recommendation set. 

Average pairwise distance: Pairwise shortest distances between the recommen- 
dations is a measure of how connected or distant the recommendations are to each 
other. It is computed with 

AVG_ pairwise .dist (S) = — — -f- — — . (23) 

Average MIN distance to M: Distance of the recommendations to the closest seed 
paper is a measure of relevance regarding the query: 

AVG-min-disLtO-M(S) = ^es^ P eMd( s , P ) _ ^ 

\b\ 

4.2. Dataset collection and queries 

We ret rieved info rmation on 1.9M computer science articles (as of March 2012) from 
DBLP [Ley 20091, 740K technical reports on physics, mathematics, and computer sci- 
ence from arXi\H and 40K publications from HAL-InriaJ^jopen access library. This data 
is well-formatted and disambiguated: however, it contains very few citation informa- 
tion (less than 470K edges). CiteSeeiQis used to increase the number of paper-to-paper 
relatio ns of computer sci ence publications, but most of its data are automatically gen- 
erated [ Giles et al. 1998 1 and are often erroneous. We mapped each document in Cite- 
Seer to at most one document in each dataset with the title information (using an 
inverted index on title words and Levenshtein distance) and publication years. Using 
the disjoint sets, we merged the papers and their corresponding metadata from four 
datasets. The papers without any references or incoming citations are discarded. The 
final citation graph has about 1M papers and 6M references, and is currently being 
used in our service. 

The query set is composed of the actual queries submitted to theadvisor service. We 
selected about 240 queries where each query is a set M of paper ids obtained from the 
bibliography files submitted by the users of the service who agreed to donating their 
queries for research purposes. \M \ varies between 1 and 130, with an average of 24.35. 

4.3. Results 

We run the algorithms on theadvisor citation graph with varying k values (i.e., k G 
{5, 10, 20, 50, 100}) and with the following parameters: a in VRRW (see Eq.[8]> is selected 
as 0.25 as suggested by the authors. For the DaRWR ranking, we use the default 
settings of the service, which are d = 0.9 for damping factor, and n = 0.75 to get more 
recommendations from recent publications. In each run, the selected algorithm gives 
a set of recommendations S, where S C V, \S\ = k, and S n M = 0. The relevancy and 
diversity measures are computed on S, unless specified otherwise, and the average of 
each measure is displayed for different k values. The standard deviations are omitted 
from the plots since they are negligible. 



J http : //arxiv . org/ 




4 http : //hal . inr ia . 


fr/ 


E http : //citeseerx . ist . psu . edu/ 
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The strategy that we choose to select the best algorithm for our purpose (i.e., diversi- 
fication of the results of theadvisor recommendations) is to eliminate the algorithms 
one-by-one with respect to their results on various relevancy and diversity measures. 
This approach might sound quite unorthodox; however, it is probably the best way 
since (1) it is not clear if scoring extremely high or low is better for some of the mea- 
sures (e.g., normalized relevancy scores of and 1 are not preferred for a diversification 
method since they are the results of random and top-fc algorithm, respectively), and 
(2) there was no method that performed the best in all metrics. Although we eliminate 
some algorithms, for completeness, we give their results for all the measures. 
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Fig. 3. Normalized relevance (a) and difference ratio (b) of the result set with respect to top-fc results. Note 
that reij)ARWR = 1 and difi'bARWR = since we compare the result set against itself. 



Fig. [3] shows the normalized relevancy and difference ratio of the recommendations 
compared to top-fc results. It is arguable that a diversity-intended algorithm should 
maximize the relevancy since top-fc results will always get the highest score, yet those 
have almost no value considering the diversity part. On the other hand, having a very 
low relevancy score would tell us that the vertices are selected randomly, having no 
connection to the query at all. 

Since the normalized relevancy does not give us a clear idea of what is expected 
from those diversity-intended methods, we compare the set difference of the results 
from top-fc relevant recommendations. Fig. |3]-b clearly shows that two methods, namely 
Feed and Dragon, give result sets that are only 10-15% different than the top-fc. In 
other words, the results of FEED and DRAGON differ in only one element when fc = 10. 
The experiments show that high difference ratio and low rel scores of IL1 and IL2 can 
be problematic in practice. 

Next, we evaluate the algorithms with respect to their usefulness scores. This exper- 
iment shows clearly that IL2 has a very low usefulness compared to other algorithms, 
scoring less than 50% for fc > 10, meaning that more than half of its recommendations 
are out of top-lOfc range (see FigBa). Dragon, Feed, and the original top-fc results 
score well on direction-aware goodness (Fig |4]-b); however, this also means that the 
goodness measure gives more importance to relevancy and little importance to diver- 
sity. 

Gr aph density is frequently used as a diversity measure in the literature [Tong et al 



2011 Li and Yu 2011]. IL1 and IL2 minimizes the ^-step graph density by construction 
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Fig. 4. Scores based on usefulness (a) and goodness (b) measures. Note that the results of DARWR are 
similar to hence, hidden behind the results of FEED and DRAGON in these plots. 




Fig. 5. ^-step graph density of the results. Note that dens is for IL1,IL2,LM at t ■■ 
and IL2 at I = 3. 
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Fig. 6. I'-step expansion ratio of the results. 



(see Fig [5); so it would not be fair to compare other methods against these. LM, k- 
RLM, and DivRANK variants, on the other hand, seem very promising for such a 
diversity objective. The same algorithms also perform good on l-step expansion ratio 
(see Fig [6]), which is related to the coverage of the graph with the recommendations. 
GRASSHOPPER and GSPARSE perform convincingly worse in these diversity metrics. 
In particular, they are more dense than the results of DaRWR. 
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Fig. 7. Results based on average minimum distance to the query (a), average pairwise shortest distance 
between the recommended papers (b), and average publication year (c). 



After evaluating the results on various relevancy and diversity metrics, we are left 
with only a couple of methods that performed well on almost all of the measures: LM, 
fc-RLM, and DivRank variants. In Fig. [7] however, we observe that PDivRank and 
CDivRank methods give a set of results that are more connected (i.e., have a low 
average pairwise distance, see Fig. [7j-b) and do not recommend recent publications 
(see Fig. R^c) although the n parameter is set accordingly. Since we are searching for 
an effective diversification method that runs on top of DaRWR, DivRank variants 
are no longer good candidates. 

4.4. Efficiency 

Running time of the algorithms is also crucial for the web service since all the recom- 
mendations are computed in real-time. We run the experiments on the same cluster 
that the service is currently using. It has a 2.4GHz AMD Opteron CPU and 4GB of 
main memory. The CPU has 64KB LI and 1MB L2 caches. DaRWR method and th e 
dataset are also optimized based on the techniques given in [K ucuktunc et al. 2012a| . 

It was expected that the complexity of the methods based on query refinement de- 
pend on and increase linearly with k. As seen in Fig [8] GRASSHOPPER, GSparse, 
and Feed methods have the longest runtimes, even though they were faster tha n Di- 
vRank variants for k < 10. This behavior was also mentioned in [Mei et al. 2010| . The 
running time of DRAGON is slightly higher than LM and fc-RLM since it updates the 
goodness vector after finding each result. 
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Fig. 8. Running times of the algorithms for varying k. Note that the running times of DARWR, IL1, IL2, 
LM, and fc-RLM are less than the others and very close to each other. 
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In short, query refinement-based methods (GRASSHOPPER, GSparse, Feed) have 
linearly increasing runtimes. DivRANK variants (PDlvRANK, CDivRank) requires 
more iterations, therefore, more time to converge. Finally, Dragon, and especially 
LM and fc-RLM are extremely efficient compared to other methods. 

4.5. Selecting the best method for the service 

Our experiments on different relevancy and diversity measures show that 

— Feed and Dragon return almost the same result set as top-fc, while the graph den- 
sity and expansion ratio measures also imply low diversity for their results, 

— the results of IL2 have a very low usefulness, 

— the results of IL1 have a low relevancy and high difference ratio, 

— GRASSHOPPER and GSparse perform worse based on the diversity measures, and 

— DivRank variants sacrifice direction-awareness for the sake of diversity. 

On the other hand, LM and fc-RLM methods perform convincingly good in almost all 
experiments, and have a better running time compared to others. fc-RLM is slightly 
better than LM since it also improves the relevancy of the set to the query. We display 
the results of 7-RLM with varying 7 parameters in Figure [9j 




Fig. 9. Results of 7-RLM with varying parameters. 



The evaluations on different metrics show that 7-RLM is able to sweep through the 
search space between all relevant (results of DaRWR) and all diverse (results of LM) 
with a varying 7 parameter. Therefore, this parameter can be set depending on the 
data and/or diversity requirements of the application. 
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5. CONCLUSIONS 

In this work, we addressed the diversification of paper recommendations of 
theadvisor service, which ranks papers in the literature with a direction-aware 
personalized PageRank algorithm. While giving a survey of diversity methods de- 
signed specifically for random walk-based rankings, we adapted those methods to our 
direction-aware problem, and proposed some new ones based on vertex selection and 
query refinement. Our experiments with various relevancy and diversity measures 
show that the proposed 7-RLM algorithm can be preferred for both its efficiency and 
effectiveness. 

We also learnt from our experiments that if one relevancy and one diversity measure 
was selected to evaluate the results of a diversification method -which is the case for 
many studies in this field-, a randomized algorithm that returns some of top ranked 
(relevant) results as well as some other random results will maximize those two mea- 
sures, even though the output would be far from satisfactory for the user. Therefore, 
the results of such diversification algorithms should be examined with respect to multi- 
ple relevancy, coverage, and difference measures. We believe that we were able to take 
steps towards a better evaluation of diversity methods in this paper. Furthermore, 
our arguments and conclusions are also valid for cases without direction-awareness 
requirement. As a future work, we will investigate the techniques used in this pa- 
per for other applications and discuss how to improve the graph-based diversification 
algorithms as well as the evaluation methods. 
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